[WIP] New Zig formal grammar #1685

Hejsil · 2018-10-27T10:54:43Z

This is an attempt at formalizing a Parsing Expression Grammar for the Zig programming language. This is done to find a better solution for #760.

Currently, I have the grammar posted here using the peg parser generator to validate it. The grammar is a breaking change from 0.3.0 (See what changed in db5d479).

The grammar implements #1047 and some of #114.

andrewrk · 2018-10-27T15:31:40Z

Thanks for doing this work.

(See what changed in 0de46ff).

try (switch (c) {

This seems worse. Why is this necessary?

         if (base.id == (comptime typeToId(T))) {

Same question

     link_err: errorset{OutOfMemory}!void,

Can we keep error as the keyword that makes error sets, and use anyerror as the new global error set primitive type?

return HashInt(unsigned_x) ^ (comptime rng.scalar(HashInt));

Would this work? return HashInt(unsigned_x) ^ comptime(rng.scalar(HashInt));

pub const LPOVERLAPPED_COMPLETION_ROUTINE = ?(extern fn (DWORD, DWORD, *OVERLAPPED) void);

This seems worse. Why is this needed?

    assert(1234 == (switch (x) {
        MultipleChoice.A => 1,
        MultipleChoice.B => 2,
        MultipleChoice.C => u32(1234),
        MultipleChoice.D => 4,
    }));

Same question.

    if (t or (x: {
        assert(f);
        break :x f;
    })) {

Same question.

The 1 token lookahead goal might not be an interesting goal to reach, but we are really close

I agree that 1 token lookahead is not an important goal to reach. I will happily trade a couple token lookaheads for any other syntactic gain.

It would also nice we had the bison parser the compiled on every commit since bison can detect ambiguaties through conflicts. Using the parser on all .zig files would also be a good idea, so that we ensure that stage1 and 2 conforms to the spec.

Let's keep the dependencies of Zig at a minimum, that is, a system c++ compiler, libLLVM, and libclang. However I would be open to a separate repository dedicated to testing Zig grammar, which we could have the CI use to run on every commit.

Hejsil · 2018-10-27T15:51:40Z

The reason switch and comptime require parens are because I gave them a high precedence, together with the other control flow expressions.

For comptime, it was done because of this mess:

// A lower percedence `comptime` Expr rule would cause this ambiguity:
async<comptime A> fn()void
//async<(comptime A> fn()void)
//async<(comptime A)> fn()void

I think we could also solve this with comptime (expr).

For switch and blocks, this was done to correctly formalise the rule that these statements shouldn't have semicolons behind them:

// Prev rules:
// Statement
//    : SwitchExpr
//    | Expr Semicolon
//    ...
// PrimaryExpr
//    : SwitchExpr
//    ...
// All these are valid, with the former grammar
switch (a) {}
switch (a) {};
{}
{};

// It would also cause this ambiguity
{}{}; // Is this two blocks, or a block followed by an initializer
switch (a) {}{}; // Same with switch

The requirement of parens around fn types is to resolve this:

fn()fn()void!void
//fn()(fn()void!void) // This is how it is parsed with the new grammar
//fn()(fn()void)!void

// This solution requires this grammar
// FnTypeExpr
//     : ErrorUnionExpr
//     | FnTypePrefix ErrorUnionExpr // Just noticed, that this ErrorUnionExpr should be a FnTypeExpr
// 
// ErrorUnionExpr
//     : PrefixExpr
//     | PrefixExpr ExclamationMark PrefixExpr
// 
// PrefixExpr
//     : SuffixExpr
//     | PrefixOp PrefixExpr // To have []fn()void, parens is required

I'll look more into laxing these paren requirements. If you have any ideas, I'm all ears :)

Hejsil · 2018-10-27T15:55:36Z

I like the seperate repo idea btw. Where do we keep the grammar? In the Zig or Zig-grammar repo?

Hejsil · 2018-10-27T15:57:42Z

Can we keep error as the keyword that makes error sets, and use anyerror as the new global error set primitive type?

We could, but what abouterror.SomeName.

andrewrk · 2018-10-27T15:59:20Z

Where do we keep the grammar?

In the separate repo I think. ziglang/zig is an implementation of the zig specification (which isn't written yet; see #75) using recursive descent, and the grammar repo would be a tool used for validating and testing the formal grammar specification.

We could, but what abouterror.SomeName.

We could make that continue to work with special syntax, yes? It's always been syntactic sugar for error{SomeName}.SomeName.

Hejsil · 2018-10-27T16:01:46Z

We could make that continue to work with special syntax, yes? It's always been syntactic sugar for error{SomeName}.SomeName.

Right, we could do that. Was trying to keep the Expr Dot Symbol to also handle this case, but I guess there is really no need for that. We can just have this PrimaryExpr: Keyword_error Dot Identifier | ...

winksaville · 2018-10-27T16:35:09Z

Since we're formalizing the grammer I'd like to suggest allowing seperators in numeric literals to improve readability.

I did some research and C++14 uses single quote ' and other languages use underscore _; rust, java, swift, ruby, perl, python and maybe others.

Of course there is at least one language where it was discussed and rejected, Go here and here.

ghost · 2018-10-27T17:19:07Z

async<comptime A> fn()void

'>' acts exactly like '{' in this case. In some places an expression can not contain '{' it always starts the function body. Here '>' always closes async. In both cases you can allow the use of parentheses to have the code parsed differently. You could always unify these simpler expressions, disallowing both '{' and '>' in both cases.

The same could be said about '[' and ']'. When inside '[' the next ']' does not apply to the current expression but to the parent. Ignoring the fact that Zig does not have a ']' operator, but that's besides the point.

// A ']' operator would work fine
const a = b ] c;
const z = b[(c]d)];

async<comptime A> fn()void also works fine.

Maybe comptime(expr) would also work fine as was suggested. But I think the strategy above might be something to keep in mind when these issues pop up.

ghost · 2018-10-27T17:39:29Z

From the compiler writer's POV, I think it's really just about keeping track of what token stops the current expression and returns to the parent. Sometimes it's ']', sometimes it's ',', sometimes it's '{', sometimes it's '>'.

Hejsil · 2018-10-27T17:53:42Z

There are still a few ambiguous that I'm not sure how to fix:
// async<A> (fn()void) ()
// (async<A> fn()void) ()
_ = async<A> fn()void ();
'>' acts exactly like '{' in this case. In some places an expression can not contain '{' it always starts the function body. Here '>' always closes async. In both cases you can allow the use of parentheses to have the code parsed differently. You could always unify these simpler expressions, disallowing both '{' and '>' in both cases.

This is not a problem with <>. The problem still exists with this example:

// async (fn()void) ()
// (async fn()void) ()
_ = async fn()void ();

The grammar is ambiguous between an async call and an async function type.

ghost · 2018-10-27T17:54:16Z

Yes, my bad I did edit my answer.

Hejsil · 2018-10-27T18:01:49Z

@winksaville #504

ghost · 2018-10-27T19:12:10Z

I think this is a nice attempt at making the grammar context free but now that the grammar is simpler for machines it needs to be refined for humans. These things stick out to me as being very awkward:

// unintuitive parentheses around if expr
const A = B + (if (rem == 0) 0 else (os.page_size - rem));

// unintuitive parentheses around return expr, now no parentheses around if expr
const A = if (B >= T.bit_count) (return 0) else @intCast(Log2Int(T), abs_shift_amt);

// no parentheses around second return stmt, parentheses around first
if (b) |b_p| (return eql(a_p, b_p)) else |_| return false;

// parentheses around if stmt
pub const ChildProcess = struct {
    pub pid: (if (is_windows) void else i32),
}

I shortened some of the identifier names.

Hejsil · 2018-10-27T19:19:20Z

@UniqueID1
I agree that the parens around the return expressions are bad, though I think some enforcement of parens around if is not too bad of an idea (maybe not when it's alone, but for some expressions, it would definitely improve readability).

wirelyre · 2018-10-28T02:16:29Z

@UniqueID1 A language is a set of strings. In this case, the set of all syntactically legal Zig programs. A grammar is a structured way of representing a language.

One property of a language (but not a grammar) is whether it is context free. If you can write a grammar for Bison, then the language it parses is context free.

BNF grammars define languages unambiguously, but not parse trees. Bison emits a warning when there are two different ways to interpret the same input. This does not affect which inputs are accepted, but rather why they are accepted (that is, not the question "is this a legal program?", but rather "what is the structure of this program?").

It is not really enough to write a grammar, but let the resolution of ambiguous parses depend on a hand-written parser. Now you have to understand the parser program; and additionally, if the parser program does not correctly implement the language defined by the grammar, you're sunk.

ghost · 2018-10-28T03:49:57Z

@wirelyre Thanks. I deleted my comment.

Hejsil · 2018-10-28T12:22:51Z

Alright, here are the two ways we can choose for comptime and return to work together with the "block exprs":

Expr: ControlFlowExpr

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen ReturnExpr else ControlFlowExpr
    | ReturnExpr

ReturnExpr
    : return PrimaryExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

// Valid code
// if (1) return 1 else return 1
// return (if (1) 1 else 1)

or

Expr: ReturnExpr

ReturnExpr
    : return ControlFlow
    | ControlFlow

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen PrimaryExpr else ControlFlowExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

// Valid code
// if (1) (return 1) else (return 1)
// return if (1) 1 else 1

The simple grammar can't have both, because it is ambiguous:

Expr: ControlFlowExpr

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen ReturnExpr else ControlFlowExpr
    | ReturnExpr

ReturnExpr
    : return ControlFlowExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

Derivation 1:

  0: Expr
  1: ControlFlowExpr
  2: if lparen Expr rparen ReturnExpr else ControlFlowExpr
  3: if lparen Expr rparen return ControlFlowExpr else ControlFlowExpr
  4: if lparen Expr rparen return if lparen Expr rparen ControlFlowExpr else ControlFlowExpr
  5: if lparen Expr rparen return if lparen Expr rparen ReturnExpr else ControlFlowExpr

Derivation 2:

  0: Expr
  1: ControlFlowExpr
  2: if lparen Expr rparen ControlFlowExpr
  3: if lparen Expr rparen ReturnExpr
  4: if lparen Expr rparen return ControlFlowExpr
  5: if lparen Expr rparen return if lparen Expr rparen ReturnExpr else ControlFlowExpr

We can special case certain things in the body of the if, to allow for return of simple expressions, but the more we special case, the harder it will be to explain the assosiation and precedence of operators (if is basicly a prefix operator).

Hejsil · 2018-10-29T07:46:55Z

I've thought long and hard about this, and I don't think we can make this grammar be context-free and unambiguous without sacrificing some of the niceness of the current syntactic constructs, or resorting to a special priority system outside the grammar (which kinda defeats the point).

I propose that we instead create the grammar as a Parsing expression grammar. The pros here are, that the grammar cannot be ambiguous as only the first matching rule will be considered. The cons to this approach are, that PEGs hide syntax flaws.

Here is how I imagine all the current ambiguities will be parsed if we have a PEG:


/  FnDef   \
fn a() !b {}
       ^|
ErrorInf|
     TypeExpr

/          IfExpr          \
          /     IfExpr     \
if (true) if (true) 1 else 1

/   Async call    \
         /FnType\
async<A> fn()void()

/    Call    \
/Async call\
async<A> a()()

/   FnType    \
    /  FnType \
        /ErrUn\
fn()fn()void!u8

/         FnType         \
      /TypeExpr\
async<comptime A> fn()void;

// This one will be hard to parse correctly, since, whith a recursive decent parser
// we will start by parsing it as:
      /    TypeExpr      \
async<comptime A> fn()void;

// The parser should then revert its current parsing, and parse it correctly...
// I think we'll just parse this wrong until we have
// https://github.com/ziglang/zig/issues/1639

Also, we can't have the semicolon rule with PEG if we wanna keep the block expression having high priority:

if (true) {} // Valid
if (true) {}; // Valid, should be an error, but we can't have this without diallowing if {} in if expressions.
if (true) A;  // Valid
if (true) A  // error expected ;

thejoshwolfe · 2018-10-29T13:20:54Z

It seems like getting rid of <> as grouping operators would solve some problems. C++ has some nasty syntax due to > being both an infix operator and a group closing operator. Zig has the same problem with async.

I know the semantics of async are planned to change in the near future. Perhaps those changes will yield different syntax. I've always been uneasy with <> as grouping operators, so avoiding them entirely seems promising.

andrewrk · 2018-10-29T19:32:44Z

I've always been uneasy with <> as grouping operators, so avoiding them entirely seems promising.

In #661 we need a grouping operator to pass a calling convention expression to the fn keyword. Do you have a syntax suggestion for this?

andrewrk · 2018-10-29T23:02:47Z

I've thought long and hard about this, and I don't think we can make this grammar be context-free and unambiguous without sacrificing some of the niceness of the current syntactic constructs, or resorting to a special priority system outside the grammar (which kinda defeats the point).

I propose that we instead create the grammar as a Parsing expression grammar. The pros here are, that the grammar cannot be ambiguous as only the first matching rule will be considered. The cons to this approach are, that PEGs hide syntax flaws.

I think this is the way to move forward. 👍 from me. This is always how I imagined the grammar working. I believe that I have incorrectly been using the term "context-free grammar" when I only meant "you can make a parse tree without doing any semantic analysis".

This reverts commit 0de46ff.

Hejsil · 2018-10-30T19:41:49Z

The prefix async to async calls is really never gonna work.

Option 1:
SuffixExpr
    <- AsyncPrefix SuffixExpr FnCallArgumnets
     / PrimaryExpr SuffixOp*

const a = async t.t();
AsyncPrefix: "async" Ok
SuffixExpr: "t.t()" Ok
FnCallArgumnets: ";" failed
PrimaryExpr: "async" failed

Option 2:
SuffixExpr <- AsyncCallExpr SuffixOp*
AsyncCallExpr
    <- AsyncPrefix PrimaryExpr FnCallArgumnets
     / PrimaryExpr

const a = async t.t();
AsyncPrefix: "async" Ok
PrimaryExpr: "t" Ok
FnCallArgumnets: "." failed
PrimaryExpr: "async" failed

As we can see, both of these examples cannot be parsed without the parser having to be context aware.

andrewrk · 2018-10-30T19:47:25Z

Can you elaborate a little? I'm focused on this copy elision stuff and so I'm not quite grokking your examples. In the current parser we look for a function call expression directly after async and it seems to work. Is there a problem with it?

Hejsil · 2018-10-30T19:49:58Z

@andrewrk What we do in both parsers, is look at what type the resulting expression of SuffixExpr is. If it's a call, then we add the async info to this node. This is parsing based on context.

Hejsil · 2018-10-30T19:50:48Z

What I show in my examples, is how the PEG grammar would parse the example. Failing, because we cannot express this check in the grammar.

Hejsil · 2018-11-12T14:43:12Z

I have all stage 1 tests passing locally. As I mentioned, we still have these being parsed incorrectly:

fn a() if (true) A {}
fn a() while (true) A {}
fn a() for ([]void{}) A {}
fn a() comptime A {}
fn a() break A {}
fn a() cancel A {}
fn a() resume A {}
fn a() return A {}

I can trivially make the break, cancel, resume and return cases work, as they can never return a type, so they can get a precedence above TypeExpr.

As for comptime, if, for, while, I think the plan is to have two syntaxes. One for an if type expression and one for if normal expression.

Also, In the grammar, if (true) {}; is a valid statement, but in my implemented parser, it is not. This is because of PEGs unlimited lookahead, whereas in my implementation, each rule only has N lookahead, where N doesn't scale with input.

For now, I'm gonna do the least amount effort to make stage 2 be able to parse the new anyerror syntax, and then leave it be. It's gonna get rewritten at some point anyway, so it can implement the new grammar when that happends.

Hejsil · 2018-11-13T08:25:46Z

Also, In the grammar, if (true) {}; is a valid statement, but in my implemented parser, it is not. This is because of PEGs unlimited lookahead, whereas in my implementation, each rule only has N lookahead, where N doesn't scale with input.

This is not correct. I just found out that it's not valid and my grammar does the semicolon rule correctly (from the tests I've done). Horray!

Hejsil · 2018-11-13T09:33:35Z

I have all stage 1 tests passing locally. As I mentioned, we still have these being parsed incorrectly:
fn a() if (true) A {}
fn a() while (true) A {}
fn a() for ([]void{}) A {}
fn a() comptime A {}
fn a() break A {}
fn a() cancel A {}
fn a() resume A {}
fn a() return A {}
I can trivially make the break, cancel, resume and return cases work, as they can never return a type, so they can get a precedence above TypeExpr.

As for comptime, if, for, while, I think the plan is to have two syntaxes. One for an if type expression and one for if normal expression.

Implemented!

Hejsil · 2018-11-13T10:23:11Z

I have all tests passing locally. This is ready to be merged once CI finish running.

If anyone wish for a zig fmt pass that can convert 0.3.0 code -> new anyerror syntax or one that reverts the struct.{} syntax, I can make one, but I don't really want to if no one is gonna use it :)

I'll make a repo for the grammar soon, and find a way to have it be build into the docs.

wqweto · 2018-11-13T12:33:55Z

Can confirm that I can parse all .zig sources in this PR (incl all tests) with parser.y grammar as input to my (independent) vbpeg parser generator which completely understands Ian's peg/leg dialect for PEGs.

If time permits will annotate my fork of the grammar (in vbpeg tests) with native language actions to produce some kind of parse tree in JSON format for reference.

Hejsil · 2018-11-13T12:44:44Z

I ended up making the fmt passes, as I neede them myself:

zig-fmt-error-to-anyerror error -> anyerror
zig-fmt-revert-dot-init-container-decl reverts Solve the return type ambiguity #1628

gernest · 2018-11-13T13:45:15Z

@Hejsil can you please help me understand how I can use zig-fmt-error-to-anyerror and zig-fmt-revert-dot-init-container-decl.

This change breaks a lot of stuff I have been working on, an automatic conversion will be helpful .
Thanks.

Hejsil · 2018-11-13T13:46:22Z

You will need to build the stage 2 compiler from these branches from source. Guide for that is in the read me

gernest · 2018-11-13T14:03:46Z

thanks

Hejsil added 2 commits October 27, 2018 11:36

Reverted #1628

e5a8135

Changed code in places where the new syntax doesn't compile

0de46ff

Hejsil added the work in progress This pull request is not ready for review yet. label Oct 27, 2018

Hejsil and others added 2 commits October 30, 2018 14:08

Revert "Changed code in places where the new syntax doesn't compile"

42a1bb2

This reverts commit 0de46ff.

Revert "Changed code in places where the new syntax doesn't compile"

37b48dc

This reverts commit 0de46ff.

Hejsil added 3 commits November 9, 2018 16:07

Code not compiling after merge

3824930

Fixed some tests, line numbering and percedence issues

36d32f6

All stage1 tests succeeds!

a50a7dd

Implemented the last syntax changes for if/for/while in return type

2640719

Hejsil added 3 commits November 13, 2018 11:10

Implemented anyerror syntax in stage2

ab9408f

zig fmt

605e870

Fixed last tests failing

3f7cb65

Updated comment to reflect grammar

9adaea8

Hejsil removed the work in progress This pull request is not ready for review yet. label Nov 13, 2018

Hejsil merged commit 8139c5a into master Nov 13, 2018

Hejsil deleted the new-grammar branch November 13, 2018 13:08

This was referenced Nov 15, 2018

Update to lastest zig with new ContainerDecl syntax. tiehuis/zig-sdl2#2

Closed

Changes for return type ambiguity breaking modification to the compil… tiehuis/zig-rosetta#1

Closed

This was referenced Mar 14, 2019

Syntax error in generated code inside zig-cache (translate-c?) #2043

Closed

workaround for #2043 #2068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] New Zig formal grammar #1685

[WIP] New Zig formal grammar #1685

Hejsil commented Oct 27, 2018 •

edited

Loading

andrewrk commented Oct 27, 2018

Hejsil commented Oct 27, 2018 •

edited

Loading

Hejsil commented Oct 27, 2018

Hejsil commented Oct 27, 2018

andrewrk commented Oct 27, 2018

Hejsil commented Oct 27, 2018

winksaville commented Oct 27, 2018 •

edited

Loading

ghost commented Oct 27, 2018 •

edited by ghost

Loading

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

wirelyre commented Oct 28, 2018

ghost commented Oct 28, 2018

Hejsil commented Oct 28, 2018

Hejsil commented Oct 29, 2018

thejoshwolfe commented Oct 29, 2018

andrewrk commented Oct 29, 2018

andrewrk commented Oct 29, 2018

Hejsil commented Oct 30, 2018

andrewrk commented Oct 30, 2018

Hejsil commented Oct 30, 2018

Hejsil commented Oct 30, 2018

Hejsil commented Nov 12, 2018 •

edited

Loading

Hejsil commented Nov 13, 2018

Hejsil commented Nov 13, 2018

Hejsil commented Nov 13, 2018

wqweto commented Nov 13, 2018

Hejsil commented Nov 13, 2018

gernest commented Nov 13, 2018

Hejsil commented Nov 13, 2018

gernest commented Nov 13, 2018

[WIP] New Zig formal grammar #1685

[WIP] New Zig formal grammar #1685

Conversation

Hejsil commented Oct 27, 2018 • edited Loading

andrewrk commented Oct 27, 2018

Hejsil commented Oct 27, 2018 • edited Loading

Hejsil commented Oct 27, 2018

Hejsil commented Oct 27, 2018

andrewrk commented Oct 27, 2018

Hejsil commented Oct 27, 2018

winksaville commented Oct 27, 2018 • edited Loading

ghost commented Oct 27, 2018 • edited by ghost Loading

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

ghost commented Oct 27, 2018

Hejsil commented Oct 27, 2018

wirelyre commented Oct 28, 2018

ghost commented Oct 28, 2018

Hejsil commented Oct 28, 2018

Hejsil commented Oct 29, 2018

thejoshwolfe commented Oct 29, 2018

andrewrk commented Oct 29, 2018

andrewrk commented Oct 29, 2018

Hejsil commented Oct 30, 2018

andrewrk commented Oct 30, 2018

Hejsil commented Oct 30, 2018

Hejsil commented Oct 30, 2018

Hejsil commented Nov 12, 2018 • edited Loading

Hejsil commented Nov 13, 2018

Hejsil commented Nov 13, 2018

Hejsil commented Nov 13, 2018

wqweto commented Nov 13, 2018

Hejsil commented Nov 13, 2018

gernest commented Nov 13, 2018

Hejsil commented Nov 13, 2018

gernest commented Nov 13, 2018

Hejsil commented Oct 27, 2018 •

edited

Loading

Hejsil commented Oct 27, 2018 •

edited

Loading

winksaville commented Oct 27, 2018 •

edited

Loading

ghost commented Oct 27, 2018 •

edited by ghost

Loading

Hejsil commented Nov 12, 2018 •

edited

Loading